Skip to content

feat: simulation suite runner (npm run sim)#18

Open
dhruva-reddy wants to merge 1 commit intodhruva-reddy/feat/validate-commandfrom
dhruva-reddy/feat/sim-runner
Open

feat: simulation suite runner (npm run sim)#18
dhruva-reddy wants to merge 1 commit intodhruva-reddy/feat/validate-commandfrom
dhruva-reddy/feat/sim-runner

Conversation

@dhruva-reddy
Copy link
Copy Markdown
Contributor

ELI5

Problem. The engine could create simulation suites and track
them in state, and AGENTS.md described simulations/suites/ as a
first-class resource type. But there was no npm run command to
actually execute a suite. npm run eval exists but runs the
legacy /evals endpoint — a different thing — and the naming
overlap actively misled engineers into running the wrong command. To
fire a simulation suite from the CLI you had to write raw curl or go
to the dashboard UI (losing reproducibility).

What this fix does. Adds npm run sim. Two shapes:

npm run sim -- <org> --suite <name> --target <assistant-or-squad>
npm run sim -- <org> --simulations <n1>,<n2> --target <assistant>

Resolves local resource names → state-file UUIDs the same way
npm run call does, POSTs /eval/simulation/run, polls the run
status, prints a summary table (pass/fail per simulation, mean run
time, structured-output evals).

Outcome you'll notice. Simulation suites become a normal part of
the gitops workflow: author the suite as YAML, push it via
npm run push, run it via npm run sim. No more dashboard
clicking. Note the AGENTS.md call-out clarifying the difference
between npm run sim (unified /eval/simulation/*) and
npm run eval (legacy /evals) — renaming eval to disambiguate
is a separate, backwards-incompatible follow-up.


Engine fully tracks simulation suites in state and AGENTS.md describes
simulations/suites/ as a first-class resource type, but there's no
npm run command to actually execute one. npm run eval runs the legacy
/evals endpoint, not the unified simulation runner. Customers go to
the dashboard UI to trigger runs (losing reproducibility) or write
per-customer shell wrappers.

  • src/sim.ts (NEW): runSimulationSuite + runSimulationsByName helpers.
    Resolves local-name → UUID via state file; POSTs /eval/simulation/run;
    polls /eval/simulation/run/:id until completion; prints pass/fail
    summary per simulation with mean run time + structured-output evals.
    Reuses src/api.ts:vapiRequest for HTTP and the local-name → UUID
    resolution pattern from src/eval.ts.
  • src/sim-cmd.ts (NEW): CLI entry. Args:
    npm run sim -- --suite --target
    npm run sim -- --simulations , --target
    npm run sim -- --suite --watch
  • package.json: sim script.
  • AGENTS.md: document npm run sim alongside npm run eval (call out the
    legacy /evals vs unified /eval/simulation/* distinction).
  • tests/sim.test.ts: arg parsing, UUID resolution, status polling,
    summary table formatting.

Note: renaming npm run eval to disambiguate is a follow-up — that's a
backwards-incompatible script-name change. For now the AGENTS.md note
calls out the distinction.

Closes improvements.md #16.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor Author

dhruva-reddy commented May 1, 2026

dhruva-reddy added a commit that referenced this pull request May 2, 2026
**Problem.** The Vapi API rejects bad configs at PATCH time with terse
400s ("property speed should not exist") — and by then the push has
already partially completed against other resources. We watched the
same five classes of mistake hit production over and over:

  1. Assistant names (or eval names) longer than 40 chars (silent cap).
  2. Structured-output ↔ assistant lockstep mismatch — one side declares
     the relationship, the other doesn't, dashboard ends up inconsistent.
  3. Prompts duplicated by paste-on-top dashboard edits (10kB prompt
     with two identical headers stacked, agent follows both).
  4. `maxTokens` set lower than the JSON-schema size of the attached
     tools' arguments — assistant looks fine on push, bricks on first
     tool-using call.
  5. Voice fields nested wrong for the provider (`voice.speed` on
     Cartesia, where it lives at `voice.generationConfig.speed`).

**What this fix does.** Five client-side validators, all running off
the same `LoadedResources` shape that `push.ts` would actually ship —
so the lint runs against exactly what would be pushed, no separate
parser to drift. Surfaces as warnings by default (one bad spec doesn't
block an otherwise-good push); promote to abort with `--strict`. Run
standalone via `npm run validate -- <org>`.

**Outcome you'll notice.** Most schema-class mistakes get caught
locally in seconds instead of mid-push 400s. Voice provider field
mismatch gets a specific message pointing at the right path. CI can
add `npm run push -- <env> --strict` as a gate before any deploy.

---

Catch the classes of errors that today only surface when the API returns
a 400 mid-push. The push pipeline runs validation in warn-only mode by
default; --strict promotes errors to a blocking abort before any API
call. Standalone runner via `npm run validate -- <org>`.

Validators implemented:

1. Name length cap (40 chars). Walks every assistant.name and every
   evaluations[].structuredOutput.name in scenarios. Closes #18.
2. SO ↔ assistant bidirectional lockstep. For every SO file's
   assistant_ids, checks the named assistant's structuredOutputIds
   mirrors it; reverse direction too. Closes #11.
3. Prompt duplication heuristics. Same H1 heading appearing twice,
   repeated CONTINUITY ON ENTRY / CLOSEOUT FLOW STRUCTURE blocks.
   Partial fix for #8 (paste-on-top dashboard duplications).
4. maxTokens floor for tool-using assistants. Computes
   floor ≈ 25 + sum(len(JSON.stringify(tool.function.parameters)))
   per attached tool. Warns under floor. Closes #19.
5. Per-provider voice schema. Cartesia rejects top-level speed /
   stability / similarityBoost / enableSsmlParsing (point at
   generationConfig.* / drop the field). 11labs rejects
   generationConfig (it's a Cartesia path). Closes #9 (engine half).

- src/validate.ts (NEW): validateResources(loadedResources) returning
  ValidationFinding[] with severity / type / resourceId / rule / message
  / fieldPath. Pure data; safe to test directly.
- src/validate-cmd.ts (NEW): CLI entry. Loads same resource shape as
  push.ts so the lint runs against exactly what would ship. Exit non-zero
  on any error finding.
- src/config.ts: --strict flag.
- src/push.ts: validators run in default-warn mode; --strict aborts.
- package.json: validate script.
- AGENTS.md: document npm run validate and --strict.
- tests/validate.test.ts: per-rule fixtures (golden + bad inputs)
  covering all five checks.

Closes improvements.md #11, #18, #19. Resolves engine half of #9.
Partial #8, #20 (heuristic only).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/validate-command branch from cb8079a to b1f91f7 Compare May 2, 2026 01:21
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/sim-runner branch from 4e55f1f to 346fbf7 Compare May 2, 2026 01:22
dhruva-reddy added a commit that referenced this pull request May 2, 2026
**Problem.** The Vapi API rejects bad configs at PATCH time with terse
400s ("property speed should not exist") — and by then the push has
already partially completed against other resources. We watched the
same five classes of mistake hit production over and over:

  1. Assistant names (or eval names) longer than 40 chars (silent cap).
  2. Structured-output ↔ assistant lockstep mismatch — one side declares
     the relationship, the other doesn't, dashboard ends up inconsistent.
  3. Prompts duplicated by paste-on-top dashboard edits (10kB prompt
     with two identical headers stacked, agent follows both).
  4. `maxTokens` set lower than the JSON-schema size of the attached
     tools' arguments — assistant looks fine on push, bricks on first
     tool-using call.
  5. Voice fields nested wrong for the provider (`voice.speed` on
     Cartesia, where it lives at `voice.generationConfig.speed`).

**What this fix does.** Five client-side validators, all running off
the same `LoadedResources` shape that `push.ts` would actually ship —
so the lint runs against exactly what would be pushed, no separate
parser to drift. Surfaces as warnings by default (one bad spec doesn't
block an otherwise-good push); promote to abort with `--strict`. Run
standalone via `npm run validate -- <org>`.

**Outcome you'll notice.** Most schema-class mistakes get caught
locally in seconds instead of mid-push 400s. Voice provider field
mismatch gets a specific message pointing at the right path. CI can
add `npm run push -- <env> --strict` as a gate before any deploy.

---

Catch the classes of errors that today only surface when the API returns
a 400 mid-push. The push pipeline runs validation in warn-only mode by
default; --strict promotes errors to a blocking abort before any API
call. Standalone runner via `npm run validate -- <org>`.

Validators implemented:

1. Name length cap (40 chars). Walks every assistant.name and every
   evaluations[].structuredOutput.name in scenarios. Closes #18.
2. SO ↔ assistant bidirectional lockstep. For every SO file's
   assistant_ids, checks the named assistant's structuredOutputIds
   mirrors it; reverse direction too. Closes #11.
3. Prompt duplication heuristics. Same H1 heading appearing twice,
   repeated CONTINUITY ON ENTRY / CLOSEOUT FLOW STRUCTURE blocks.
   Partial fix for #8 (paste-on-top dashboard duplications).
4. maxTokens floor for tool-using assistants. Computes
   floor ≈ 25 + sum(len(JSON.stringify(tool.function.parameters)))
   per attached tool. Warns under floor. Closes #19.
5. Per-provider voice schema. Cartesia rejects top-level speed /
   stability / similarityBoost / enableSsmlParsing (point at
   generationConfig.* / drop the field). 11labs rejects
   generationConfig (it's a Cartesia path). Closes #9 (engine half).

- src/validate.ts (NEW): validateResources(loadedResources) returning
  ValidationFinding[] with severity / type / resourceId / rule / message
  / fieldPath. Pure data; safe to test directly.
- src/validate-cmd.ts (NEW): CLI entry. Loads same resource shape as
  push.ts so the lint runs against exactly what would ship. Exit non-zero
  on any error finding.
- src/config.ts: --strict flag.
- src/push.ts: validators run in default-warn mode; --strict aborts.
- package.json: validate script.
- AGENTS.md: document npm run validate and --strict.
- tests/validate.test.ts: per-rule fixtures (golden + bad inputs)
  covering all five checks.

Closes improvements.md #11, #18, #19. Resolves engine half of #9.
Partial #8, #20 (heuristic only).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/validate-command branch from b1f91f7 to 3558d10 Compare May 2, 2026 01:27
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/sim-runner branch from 346fbf7 to 7e5eb7f Compare May 2, 2026 01:27
dhruva-reddy added a commit that referenced this pull request May 2, 2026
**Problem.** The Vapi API rejects bad configs at PATCH time with terse
400s ("property speed should not exist") — and by then the push has
already partially completed against other resources. We watched the
same five classes of mistake hit production over and over:

  1. Assistant names (or eval names) longer than 40 chars (silent cap).
  2. Structured-output ↔ assistant lockstep mismatch — one side declares
     the relationship, the other doesn't, dashboard ends up inconsistent.
  3. Prompts duplicated by paste-on-top dashboard edits (10kB prompt
     with two identical headers stacked, agent follows both).
  4. `maxTokens` set lower than the JSON-schema size of the attached
     tools' arguments — assistant looks fine on push, bricks on first
     tool-using call.
  5. Voice fields nested wrong for the provider (`voice.speed` on
     Cartesia, where it lives at `voice.generationConfig.speed`).

**What this fix does.** Five client-side validators, all running off
the same `LoadedResources` shape that `push.ts` would actually ship —
so the lint runs against exactly what would be pushed, no separate
parser to drift. Surfaces as warnings by default (one bad spec doesn't
block an otherwise-good push); promote to abort with `--strict`. Run
standalone via `npm run validate -- <org>`.

**Outcome you'll notice.** Most schema-class mistakes get caught
locally in seconds instead of mid-push 400s. Voice provider field
mismatch gets a specific message pointing at the right path. CI can
add `npm run push -- <env> --strict` as a gate before any deploy.

---

Catch the classes of errors that today only surface when the API returns
a 400 mid-push. The push pipeline runs validation in warn-only mode by
default; --strict promotes errors to a blocking abort before any API
call. Standalone runner via `npm run validate -- <org>`.

Validators implemented:

1. Name length cap (40 chars). Walks every assistant.name and every
   evaluations[].structuredOutput.name in scenarios. Closes #18.
2. SO ↔ assistant bidirectional lockstep. For every SO file's
   assistant_ids, checks the named assistant's structuredOutputIds
   mirrors it; reverse direction too. Closes #11.
3. Prompt duplication heuristics. Same H1 heading appearing twice,
   repeated CONTINUITY ON ENTRY / CLOSEOUT FLOW STRUCTURE blocks.
   Partial fix for #8 (paste-on-top dashboard duplications).
4. maxTokens floor for tool-using assistants. Computes
   floor ≈ 25 + sum(len(JSON.stringify(tool.function.parameters)))
   per attached tool. Warns under floor. Closes #19.
5. Per-provider voice schema. Cartesia rejects top-level speed /
   stability / similarityBoost / enableSsmlParsing (point at
   generationConfig.* / drop the field). 11labs rejects
   generationConfig (it's a Cartesia path). Closes #9 (engine half).

- src/validate.ts (NEW): validateResources(loadedResources) returning
  ValidationFinding[] with severity / type / resourceId / rule / message
  / fieldPath. Pure data; safe to test directly.
- src/validate-cmd.ts (NEW): CLI entry. Loads same resource shape as
  push.ts so the lint runs against exactly what would ship. Exit non-zero
  on any error finding.
- src/config.ts: --strict flag.
- src/push.ts: validators run in default-warn mode; --strict aborts.
- package.json: validate script.
- AGENTS.md: document npm run validate and --strict.
- tests/validate.test.ts: per-rule fixtures (golden + bad inputs)
  covering all five checks.

Closes improvements.md #11, #18, #19. Resolves engine half of #9.
Partial #8, #20 (heuristic only).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/validate-command branch from 3558d10 to bcd23de Compare May 2, 2026 01:31
dhruva-reddy added a commit that referenced this pull request May 2, 2026
**Problem.** The Vapi API rejects bad configs at PATCH time with terse
400s ("property speed should not exist") — and by then the push has
already partially completed against other resources. We watched the
same five classes of mistake hit production over and over:

  1. Assistant names (or eval names) longer than 40 chars (silent cap).
  2. Structured-output ↔ assistant lockstep mismatch — one side declares
     the relationship, the other doesn't, dashboard ends up inconsistent.
  3. Prompts duplicated by paste-on-top dashboard edits (10kB prompt
     with two identical headers stacked, agent follows both).
  4. `maxTokens` set lower than the JSON-schema size of the attached
     tools' arguments — assistant looks fine on push, bricks on first
     tool-using call.
  5. Voice fields nested wrong for the provider (`voice.speed` on
     Cartesia, where it lives at `voice.generationConfig.speed`).

**What this fix does.** Five client-side validators, all running off
the same `LoadedResources` shape that `push.ts` would actually ship —
so the lint runs against exactly what would be pushed, no separate
parser to drift. Surfaces as warnings by default (one bad spec doesn't
block an otherwise-good push); promote to abort with `--strict`. Run
standalone via `npm run validate -- <org>`.

**Outcome you'll notice.** Most schema-class mistakes get caught
locally in seconds instead of mid-push 400s. Voice provider field
mismatch gets a specific message pointing at the right path. CI can
add `npm run push -- <env> --strict` as a gate before any deploy.

---

Catch the classes of errors that today only surface when the API returns
a 400 mid-push. The push pipeline runs validation in warn-only mode by
default; --strict promotes errors to a blocking abort before any API
call. Standalone runner via `npm run validate -- <org>`.

Validators implemented:

1. Name length cap (40 chars). Walks every assistant.name and every
   evaluations[].structuredOutput.name in scenarios. Closes #18.
2. SO ↔ assistant bidirectional lockstep. For every SO file's
   assistant_ids, checks the named assistant's structuredOutputIds
   mirrors it; reverse direction too. Closes #11.
3. Prompt duplication heuristics. Same H1 heading appearing twice,
   repeated CONTINUITY ON ENTRY / CLOSEOUT FLOW STRUCTURE blocks.
   Partial fix for #8 (paste-on-top dashboard duplications).
4. maxTokens floor for tool-using assistants. Computes
   floor ≈ 25 + sum(len(JSON.stringify(tool.function.parameters)))
   per attached tool. Warns under floor. Closes #19.
5. Per-provider voice schema. Cartesia rejects top-level speed /
   stability / similarityBoost / enableSsmlParsing (point at
   generationConfig.* / drop the field). 11labs rejects
   generationConfig (it's a Cartesia path). Closes #9 (engine half).

- src/validate.ts (NEW): validateResources(loadedResources) returning
  ValidationFinding[] with severity / type / resourceId / rule / message
  / fieldPath. Pure data; safe to test directly.
- src/validate-cmd.ts (NEW): CLI entry. Loads same resource shape as
  push.ts so the lint runs against exactly what would ship. Exit non-zero
  on any error finding.
- src/config.ts: --strict flag.
- src/push.ts: validators run in default-warn mode; --strict aborts.
- package.json: validate script.
- AGENTS.md: document npm run validate and --strict.
- tests/validate.test.ts: per-rule fixtures (golden + bad inputs)
  covering all five checks.

Closes improvements.md #11, #18, #19. Resolves engine half of #9.
Partial #8, #20 (heuristic only).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/validate-command branch from bcd23de to 5cb218f Compare May 2, 2026 01:31
## ELI5

**Problem.** The engine could *create* simulation suites and track
them in state, and AGENTS.md described `simulations/suites/` as a
first-class resource type. But there was no `npm run` command to
actually *execute* a suite. `npm run eval` exists but runs the
*legacy* `/evals` endpoint — a different thing — and the naming
overlap actively misled engineers into running the wrong command. To
fire a simulation suite from the CLI you had to write raw curl or go
to the dashboard UI (losing reproducibility).

**What this fix does.** Adds `npm run sim`. Two shapes:

```
npm run sim -- <org> --suite <name> --target <assistant-or-squad>
npm run sim -- <org> --simulations <n1>,<n2> --target <assistant>
```

Resolves local resource names → state-file UUIDs the same way
`npm run call` does, POSTs `/eval/simulation/run`, polls the run
status, prints a summary table (pass/fail per simulation, mean run
time, structured-output evals).

**Outcome you'll notice.** Simulation suites become a normal part of
the gitops workflow: author the suite as YAML, push it via
`npm run push`, run it via `npm run sim`. No more dashboard
clicking. Note the AGENTS.md call-out clarifying the difference
between `npm run sim` (unified `/eval/simulation/*`) and
`npm run eval` (legacy `/evals`) — renaming `eval` to disambiguate
is a separate, backwards-incompatible follow-up.

---

Engine fully tracks simulation suites in state and AGENTS.md describes
simulations/suites/ as a first-class resource type, but there's no
npm run command to actually execute one. npm run eval runs the legacy
/evals endpoint, not the unified simulation runner. Customers go to
the dashboard UI to trigger runs (losing reproducibility) or write
per-customer shell wrappers.

- src/sim.ts (NEW): runSimulationSuite + runSimulationsByName helpers.
  Resolves local-name → UUID via state file; POSTs /eval/simulation/run;
  polls /eval/simulation/run/:id until completion; prints pass/fail
  summary per simulation with mean run time + structured-output evals.
  Reuses src/api.ts:vapiRequest for HTTP and the local-name → UUID
  resolution pattern from src/eval.ts.
- src/sim-cmd.ts (NEW): CLI entry. Args:
    npm run sim -- <org> --suite <name> --target <assistant-or-squad>
    npm run sim -- <org> --simulations <n1>,<n2> --target <assistant>
    npm run sim -- <org> --suite <name> --watch
- package.json: sim script.
- AGENTS.md: document npm run sim alongside npm run eval (call out the
  legacy /evals vs unified /eval/simulation/* distinction).
- tests/sim.test.ts: arg parsing, UUID resolution, status polling,
  summary table formatting.

Note: renaming npm run eval to disambiguate is a follow-up — that's a
backwards-incompatible script-name change. For now the AGENTS.md note
calls out the distinction.

Closes improvements.md #16.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@dhruva-reddy dhruva-reddy force-pushed the dhruva-reddy/feat/sim-runner branch from 7e5eb7f to ffa91d9 Compare May 2, 2026 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants